Bayesian multiple-instance motif discovery with BAMBI: inference of recombinase and transcription factor binding sites
نویسندگان
چکیده
Finding conserved motifs in genomic sequences represents one of essential bioinformatic problems. However, achieving high discovery performance without imposing substantial auxiliary constraints on possible motif features remains a key algorithmic challenge. This work describes BAMBI-a sequential Monte Carlo motif-identification algorithm, which is based on a position weight matrix model that does not require additional constraints and is able to estimate such motif properties as length, logo, number of instances and their locations solely on the basis of primary nucleotide sequence data. Furthermore, should biologically meaningful information about motif attributes be available, BAMBI takes advantage of this knowledge to further refine the discovery results. In practical applications, we show that the proposed approach can be used to find sites of such diverse DNA-binding molecules as the cAMP receptor protein (CRP) and Din-family site-specific serine recombinases. Results obtained by BAMBI in these and other settings demonstrate better statistical performance than any of the four widely-used profile-based motif discovery methods: MEME, BioProspector with BioOptimizer, SeSiMCMC and Motif Sampler as measured by the nucleotide-level correlation coefficient. Additionally, in the case of Din-family recombinase target site discovery, the BAMBI-inferred motif is found to be the only one functionally accurate from the underlying biochemical mechanism standpoint. C++ and Matlab code is available at http://www.ee.columbia.edu/~guido/BAMBI or http://genomics.lbl.gov/BAMBI/.
منابع مشابه
A pr 2 01 2 BAYESIAN CENTROID ESTIMATION FOR MOTIF DISCOVERY
Biological sequences may contain patterns that are signal important biomolecular functions; a classical example is regulation of gene expression by transcription factors that bind to specific patterns in genomic promoter regions. In motif discovery we are given a set of sequences that share a common motif and aim to identify not only the motif composition, but also the binding sites in each seq...
متن کاملBayesMD: Flexible Biological Modeling for Motif Discovery
We present BayesMD, a Bayesian Motif Discovery model with several new features. Three different types of biological a priori knowledge are built into the framework in a modular fashion. A mixture of Dirichlets is used as prior over nucleotide probabilities in binding sites. It is trained on transcription factor (TF) databases in order to extract the typical properties of TF binding sites. In a ...
متن کاملBayesian Model Based Approaches In The Analysis Of Chromatin Structure And Motif Discovery
RITEN MITRA: Bayesian Model Based Approaches In The Analysis Of Chromatin Structure And Motif Discovery. (Under the direction of Dr P.K.Sen.) Efficient detection of transcription factor (TF) binding sites is an important and unsolved problem in computational genomics. Recently, due to the poor predictive ability of motif finding algorithms, along with the recent proliferation of high-throughput...
متن کاملThe NestedMICA motif inference tool
NestedMICA is a sensitive, scalable, pattern-discovery system, aimed at finding transcription factor binding sites and similar motifs in biological sequence. More discussion of the principles behind NestedMICA, and an evaluation of its sensitivity can be found in a recent paper [?]. If you have any problems or questions about the system, please contact Thomas Down.
متن کاملHeterogeneity in DNA multiple alignments: modeling, inference, and applications in motif finding.
Transcription factors bind sequence-specific sites in DNA to regulate gene transcription. Identifying transcription factor binding sites (TFBSs) is an important step for understanding gene regulation. Although sophisticated in modeling TFBSs and their combinatorial patterns, computational methods for TFBS detection and motif finding often make oversimplified homogeneous model assumptions for ba...
متن کامل